Objective: introduce exploratory analysis and a tool for exploring variation in one variable.
The term and spirit of EDA is attributed to John Tukey, whose philosophically-leaning work in statistics in the 1960's and 1970's elaborated on the need for certain shifts in the standard disciplinary paradigms of statistics.
For a long time I have thought I was a statistician, interested in inferences from the particular to the general. (1962)
The classical statistical paradigm: sample $\longrightarrow$ estimate population properties $\longrightarrow$ test hypotheses. The starting point is often a set of assumptions.
The term and spirit of EDA is attributed to John Tukey, whose philosophically-leaning work in statistics in the 1960's and 1970's elaborated on the need for certain shifts in the standard disciplinary paradigms of statistics.
All in all, I have come to feel that my central interest is in data analysis [which] is a larger and more varied field than inference. (1962)
Non-inferential techniques (like graphics and tabular summaries) can help to identify the particularities of a dataset, which may not conform to common statistical assumptions.
The term and spirit of EDA is attributed to John Tukey, whose philosophically-leaning work in statistics in the 1960's and 1970's elaborated on the need for certain shifts in the standard disciplinary paradigms of statistics.
'Data analysis' ... allows us to use probability where it is ineeded and avoid it when we should. (1972)
An intitial 'model-free' stage of investigation pays off, leading to more realistic models or alternative analyses when modeling is not an appropriate choice.
When inferences are the central goal of an analysis, data often serve the role of evidence for or against prespecified hypotheses.
For example, in last year's COVID vaccine trials, the objective was to determine whether vaccines were effective.
If the vaccine had little or no effect, observed cases would be distributed about 50-50 between the groups.
| Vaccine group | Placebo group |
|---|---|
| 11 cases | 185 cases |
The estimated probability that a case was in the vaccine group is 0.056 $\quad\Longrightarrow\quad$ evidence of effect.
In many of the applications we've seen so far, data instead serve the role of information about some phenomenon.
To stick with the same example, the very same vaccine trial data was also used to assess safety by mointoring side effects. For this trial target, there was no specific hypothesis (e.g., the vaccine causes headaches).

Tukey's ideas were a driving force in establishing a tradition of exploratory data analysis (EDA):
EDA stands in contrast to confirmatory data analysis (CDA):
The picture that most practitioners have of modern data science is that EDA precedes CDA.
In this view, EDA is used to establish a background and a plan by:
Then, CDA consists in:
Aside: Historically, statistics has focused on CDA -- and therefore a lot of your PSTAT coursework does, too.
Wickham encourages us to do EDA by asking questions:
When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make. (R for Data Science)
There are two basic kinds of questions that are always useful:
The primary tools of EDA are graphics and tables, and arguably moreso the former (graphics).
You've already seen a lot of examples in HW and lab of useful graphics and tables.
We will focus on question 1 this week -- variation within variables -- and introduce smoothing as a visual EDA technique.
Variation in data is the tendency of values to change from measurement to measurement.
The following are four measurements of particulate matter 2.5 microns or smaller in Aberdeen, South Dakota. They're around 8 $\frac{\mu g}{m^3}$, but each one is different.
pmdata.head(4)
| PM25_mean | City | State | Year | |
|---|---|---|---|---|
| 0 | 8.6 | Aberdeen | SD | 2000 |
| 1 | 8.6 | Aberdeen | SD | 2001 |
| 2 | 7.9 | Aberdeen | SD | 2002 |
| 3 | 8.4 | Aberdeen | SD | 2003 |
So what does it mean to ask what 'type' of variation there is in a variable?
Well, there isn't a methodical classification system for types of variation.
So often the best one can do is seek to answer some useful questions about the values.
These questions often lead the way to more focused ones.
There are a lot of ways to describe variation quantitatively, but perhaps the simplest approach is to visualize the distribution of its values.
The following histogram shows the distribution of PM 2.5 concentrations from several thousand measurements taken in about 200 U.S. cities over a 20-year period.
hist1a
The national standard is 12 micrograms per cubic meter. Over 1,000 measurements exceeded this. Was it just a few cities, or more widespread?
Maybe it would help to see years separately, so that we can see the distribution across cities each year to get a sense of how many cities exceed the benchmark in a typical year.
hist_annual
But there are a lot of years, and it's hard to understand the overlapping bars.
Visually, it's a lot easier to distinguish overlapping lines than overlapping bars. Smoothing out the histogram produces this:
density
This shows the variation in PM 2.5 is diminishing over time, and the typical values are getting smaller. Good news!
These curves are known as kernel density estimates, and they are often useful for answering questions about variation in EDA.
To explain how they're constructed, we'll first need to talk about rescaling histograms. Then we can get to smoothing.
You might recall from 120A that a probability density/mass function has two properties:
Histograms are almost proper density functions: they satisfy (1) but not (2).
A simple rescaling can fix this. You may appreciate why this is desirable in more depth later on.
For now, a practical view is that this ensures that any two histograms are on comparable scales.
Typically, the bar height is simply a count of the number of observations in each bin. This is a count scale histogram.
If the values are $x_1, \dots, x_n$, then the height of the bar for the $j$th bin $B_j = (a_j, b_j]$ is: $$height_j = \sum_{i = 1}^n \mathbf{1}\{x_i \in B_j\} = \# \{x_i \in B_j\}$$
While intuitive, this makes the bar heights incomparable in scale whenever the bin width is changed (or the sample size).
hist1a + hist2a
A fix that ensures comparability of scale for any two histograms is to normalize heights by bin width and sample size.
$$height_j = \color{red}{\frac{1}{nb}} \sum_{i = 1}^n \mathbf{1}\{x_i \in B_j\} = \# \{x_i \in B_j\} \quad\text{where}\quad b = |B_j|$$Now the area under the histogram is $\sum_j |B_j|\times height_j = 1$.
So we call this a density scale histogram, because it is a valid probability density.
Here are two density scale histograms with different choices of bin width. The bar heights are now on comparable scales.
hist1b
hist2b
Kernel density estimates are local smoothings of the density scale histogram.
This can be seen by comparing the type of smooth curve we saw earlier with the density scale histogram.
hist1b.properties(height = 300, width = 300) + kde1b
Technically KDE is a convolution filtering (if that means anything to you).
We can try to understand it in more intuitive terms by developing the idea constructively from the density histogram in two steps.
In what follows we're going to express the histogram mathematically as a function of data values.
This will require the use of indicator functions, which are simply functions that are 1 or 0 depending on a condition. They are denoted like this:
Indicators are useful for expressing counts because the sum of an indicator gives the number of times the condition is met.
For example, given values $x_1, x_2, \dots, x_n$:
The value (height) of the density scale histogram at an arbitrary point $\color{red}{x}$ is $$\text{hist}(\color{red}{x}) = \frac{1}{nb} \sum_{i = 1}^n \sum_{j} \mathbf{1}\{\color{red}{x} \in B_j\} \mathbf{1}\{x_i \in B_j\}$$
Here's what those indicators do:
$$\mathbf{1}\{\color{red}{x} \in B_j\} \quad \text{picks out a bin}\;,\qquad \mathbf{1}\{x_i \in B_j\} \quad \text{picks out the data points in the bin}$$Here's a picture:

One could do something more adaptive by allowing the height at $\color{red}{x}$ to be a normalization of the count in a neighborhood of $x$ rather than in one of a fixed set of bins: $$\text{localhist}(\color{red}{x}) = \frac{1}{nb} \sum_{i = 1}^n \mathbf{1}\left\{|x_i - \color{red}{x}| < \frac{b}{2}\right\} = \frac{1}{nb}\#\{x_i \in B(\color{red}{x})\}$$
Let's call this a local histogram, because the height at any point $\color{red}{x}$ is determined relative to the exact location of $\color{red}{x}$.


drawing.properties(width = 500, height = 300)
Here's what that would look like with $b = 6$:
localhist
Zooming in reveals that this is a step function:
localhist_detail
The local histogram approach is: $$\frac{1}{nb} \sum_{i = 1}^n \color{blue}{\mathbf{1}\left\{|x_i - \color{red}{x}| < \frac{b}{2}\right\}}$$
The portion in blue is known as a kernel function, because it gives the core components that are aggregated.
Since the kernel is either a 1 or 0, summing the kernel over the data points returns a count for any $\color{red}{x}$.
The local histogram is actually a kernel density estimate with a discrete kernel function.
Let's replace that with the function $\varphi$: $$\hat{f}(\color{red}{x}) = \frac{1}{nb} \sum_{i = 1}^n \color{blue}{\varphi\left(\frac{x_i - \color{red}{x}}{b}\right)}$$
Where $\varphi$ is the standard Gaussian density: $$\varphi(z) = \frac{1}{\sqrt{2\pi}}\exp\left\{- \frac{z^2}{2}\right\}$$
This is the (Gaussian) kernel density estimate (KDE).
In effect, the KDE curve at any point $\color{red}{x}$ is a weighted aggregation of all the data with weights proportional to their distance from $\color{red}{x}$.
The Gaussian kernel function is really what's responsible for the smoothness of a KDE.
Let's compare the KDE with the local histogram:
localhist + kde1b
Almost identical, except the KDE is smooth.
There is a bandwidth parameter, above denoted $b$, that controls how wiggly the KDE curve is.
(hist1b + kde1b_highb) | (hist1b + kde1b_lowb)
The left panel is too smooth -- the KDE curve doesn't get the shape right.
But the right panel is too wiggly -- the KDE curve hardly smooths out the shape at all.
There are methods for determining 'optimal' choices of bandwidth, but usually it's picked by trial-and-error.
It's important to be aware that the smoothing bandwidth can change the visual impression of what's going on in a plot and affect choices later in analysis.
density_higherb | density_lowerb
Here they mostly tell the same story about what happens over time, but the higher bandwidth obscures a few outliers.
Next question: which cities/years are the outliers? What, if anything, explains the abnormally high PM2.5?
In PSTAT 120A, you learned about random variables.
At their essence, random variables are models of the natural variation we see in life.
All those probability distributions you covered -- the binomial, Gaussian/normal, Poisson, uniform, gamma, etc. -- are theoretical constructs describing different types of variation. In other words, models.
Their density/mass functions show the distribution of values of a random variable. You can think of this as an abstraction of the histogram.
The conditional distribution of $Y$ given $X$ is the distribution of values of a random variable $Y$ when another random variable $X$ is fixed at some value.
When $X, Y$ are dependent, the conditional distribution of $(Y|X = x)$ changes with $x$.
In our PM2.5 example, we were examining how the distribution of PM2.5 concentration differs by year.
density
In other words, we were looking at the conditional distribution of PM2.5, given year.
When exploring a dataset -- seeking to understand variation and co-variation -- usually the analyst is investigating conditional distributions (whether they know it or not!).
Think about this as we move next week into discussing techniques for exploring covariation.